AI-ASSISTED DIGITALISATION OF HISTORICAL DOCUMENTS
نویسندگان
چکیده
Abstract. Preserving historical archival heritage involves not only physical measures to safeguard these valuable texts but also providing for their digital preservation. However, merely digitising manuscripts and codexes is enough. A further step needed: the digitalisation of content, i.e. verbatim transcription scanned texts. This process enables accurate preservation textual making it easier search information conduct analyses. With help artificial intelligence, particularly Deep Neural Networks (DNNs), automatic handwriting recognition can be performed. In this study, we employed a Convolutional Recurrent Network (CRNN), an established type DNN, determine minimum amount labelled data required automatically transcribe five different datasets that vary in language time period. The results show Character Error Rate (CER) lower than 10% achieved with just few hundred text lines almost all cases.
منابع مشابه
Flexible Computer Assisted Transcription of Historical Documents Through Subword Spotting
In the absence of accurate handwriting recognition for historical documents, computer assisted transcription (CAT) methods move into the spotlight. We explore some of the weaknesses of current CAT systems and propose a CAT system which relies on subword spotting that overcomes most of these. The system is ideal crowdsourcing transcription to mobile users.
متن کاملHistorical Documents Modernization
Historical documents are mostly accessible to scholars specialized in the period in which the document originated. In order to increase their accessibility to a broader audience and help in the preservation of the cultural heritage, we propose a method to modernized these documents. This method is based in statistical machine translation, and aims at translating historical documents into a mode...
متن کاملUnsupervised Transcription of Historical Documents
We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially...
متن کاملExploiting Collection Level for Improving Assisted Handwritten Words Transcription of Historical Documents
Transcription of handwritten words in historical documents is still a difficult task. When processing huge amount of pages, document centered approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user annotation work. In this article, we investigate the use of inter page dependencies to overcome those limitations. For this, we propose a new...
متن کاملMining dates from historical documents
The essential quality of information in a digital library is accessibility. Full text search is not enough for some collections, more can be done. Historical collections, for example, contain dates, and it would be useful to historians to be able to search by them. However, these dates occur anywhere within the text of historical documents, and to be searched they must be extracted from the doc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
سال: 2023
ISSN: ['1682-1777', '1682-1750', '2194-9034']
DOI: https://doi.org/10.5194/isprs-archives-xlviii-m-2-2023-557-2023